Can Star Wars-related survey responses be used to predict whether a person earns more than $50,000 annually??
Data Loading & Missing Value Analysis
Show the code
# First 2 rows are column headers, created a map to make it easy to readnew_col_names = ["id", "seen_any", "fan_starwars", "seen_Ep_I", "seen_Ep_II", "seen_Ep_III", "seen_Ep_IV", "seen_Ep_V", "seen_Ep_VI", "rank_Ep_I", "rank_Ep_II", "rank_Ep_III", "rank_Ep_IV", "rank_Ep_V", "rank_Ep_VI", "fav_han", "fav_luke", "fav_leia", "fav_anakin", "fav_obi", "fav_palpatine", "fav_darth", "fav_lando", "fav_boba", "fav_c3po", "fav_r2", "fav_jar", "fav_padme", "fav_yoda", "who_shot_first", "familiar_expanded_universe", "fan_expanded_universe", "fan_startrek", "gender", "age", "income", "educ", "region"] # Read in the data to a pandas data-frame from CSV githubdf = pd.read_csv("https://github.com/fivethirtyeight/data/raw/master/star-wars-survey/StarWars.csv", encoding_errors="ignore", names = new_col_names, skiprows =2)# Replace common missing indicators with np.nandf = df.replace(['NA', 'N/A', 'na', 'n/a', '', 'NaN', 'nan', 999, -99, None], np.nan)
The dataset now loads cleanly into a DataFrame with 38 columns covering Star Wars movie viewership, rankings, favorite characters, and demographic information (gender, age, income, education, region).
From here, I’ll inspect the structure, data types, and missing values to prompt the direction of this project.
Missing Value Analysis
The columns with the highest percentage of missing values reveal clear patterns in survey behavior:
Show the code
# Get data types for internal usedatatypes = df.dtypes# Percent missing values of each columnmissing = (df.isna().sum() /len(df) *100).round(1)display(f"Top 10 Missing Values Percentages of Columns")display(missing.sort_values(ascending=False).head(10)) # Show top 10
Because of the extreme problems of missing values in columns, I will now filter on respondents who have at least seen one movie. I will also drop the fan_expanded_universe column as no insights can come from a column of mostly missing values.
Show the code
# Filtered to only include those who have inputted a movie they have seen, don't use seen_any column as many of those who answered 'yes', also left all of those blank when listing the movies they have seendf = df.dropna(subset=df.columns[2:8], how='all')# Remove all rows where income (target variable) is NAdf = df.dropna(subset=['income'])# Drop id columnsdf = df.dropna(subset=['id'])# Drop fan_expanded_universe columndf = df.drop(['fan_expanded_universe'], axis=1)# Select columns 3–8cols = df.columns[2:9]# Replace NaN with "no", everything else with "yes"df[cols] = df[cols].applymap(lambda x: 'yes'if pd.notna(x) else'no')
After the filtering out those who have not seen any star wars movie and those who did not list their income level, the total number of respondents went from 1186, down to 675. This is because it neglects the purpose of the project, which is to use star wars respondents to predict income level. It would not make sense to use peoples responses who have not at least seen one movie nor have listed their income.
One issue I forsee happening is the data is getting smaller and smaller, but still a a decent size. I essentially cut 57% of the respondents out from the survey.
Columns regarding on if the respondent is a fan, or if they have seen certain movies, also have a lot of NA’s. Further investigation reveals they only have one option to choose from, thus I will interpret blank answers as being they have not seen that movie.
Show the code
# Percent missing values of each column, filtered by those have have seen one movie at leastmissing = (df.isna().sum() /len(df) *100).round(1)display(f"Top 10 Missing Values Percentages of Columns")display(missing.sort_values(ascending=False).head(10)) # Show top 10
As you can now see, the data set is a lot cleaner when it comes to missing values.
The next step will be to examine the nature of the target variable and key predictors in order to determine which model(s) will be best suited for this project.
Exploratory Data Analysis
Show the code
# Grab value counts and Get proportion of column it representsincome_counts = df['income'].value_counts(dropna=False)income_props = df['income'].value_counts(normalize=True, dropna=False).round(2)# Display the counts and proportions as a tabledisplay(pd.DataFrame({'Count': income_counts, 'Proportion': income_props}))# Order income for graph mapincome_order = ['$0 - $24,999', '$25,000 - $49,999', '$50,000 - $99,999','$100,000 - $149,999','$150,000+']# Convert to ordered categoricaldf['income'] = pd.Categorical(df['income'], categories=income_order, ordered=True)display(ggplot(df, aes(x='income')) +\ geom_bar(fill="#4C72B0") +\ ggtitle("Income Distribution") +\ xlab("Income Category") +\ ylab("Number of Respondents") + theme(axis_text_x=element_text(angle=30, hjust=1)))
Count
Proportion
income
$50,000 - $99,999
238
0.35
$25,000 - $49,999
147
0.22
$100,000 - $149,999
115
0.17
$0 - $24,999
98
0.15
$150,000+
77
0.11
The income variable contains six categories. with the distribution showing two important points:
The 50,000 - 99,999 income level is decently biased, with about 35% of the respondents falling in. The rest of the income levels, range in between 11-22%. The variation of levels is slightly imbalanced, but we should be able to tweak some things depending on the model route I choose. Oddly enough, we see a normal distribution of income levels across the dataset, which could prove useful for statistical analysis later on.
Although this is in the incorrect format for analysis, I will bin on 50k or more. We end up getting 64% of answers being 50k or more. This imbalance could show up in recall and f1 scores later on if ML is the route.
Show the code
# Convert Income ranges to values 0 for less than 50kincome_map = {'$0 - $24,999': 0,'$25,000 - $49,999': 0,'$50,000 - $99,999': 1,'$100,000 - $149,999': 1,'$150,000+': 1}# Replacedf.loc[:,'income'] = df.loc[:,'income'].replace(income_map)# Grab value counts and Get proportion of column it representsincome_counts = df['income'].value_counts(dropna=False)income_props = df['income'].value_counts(normalize=True, dropna=False).round(2)# Display the counts and proportions as a tabledisplay(pd.DataFrame({'Count': income_counts, 'Proportion': income_props}))
Count
Proportion
income
1
430
0.64
0
245
0.36
I will now move on to converting string-ordinal columns, to numeric-ordinal column for the purpose of visualizing, finding correlations, and preparing for modeling.
Show the code
# Convert age to numeric, picking the median of each, 66 for >60 because 72 is average life expectancyage_map = {'18-29': 23, '30-44': 37, '45-60': 52.5, '> 60': 66}# Replacedf['age'] = df['age'].map(age_map)# Preserve order in fav columnsfav_order = ['Very unfavorably','Somewhat unfavorably','Neither favorably nor unfavorably (neutral)','Unfamiliar (N/A)','Somewhat favorably','Very favorably']for col in df.columns[15:29]: df[col] = pd.Categorical(df[col], categories=fav_order, ordered=True)# Convert fav_ columns to number rankfav_map = {'Very favorably': 2,'Somewhat favorably': 1,'Neither favorably nor unfavorably (neutral)': 0,'Unfamiliar (N/A)': 0,'Somewhat unfavorably': -1,'Very unfavorably': -2}# Replacedf.iloc[:, list(range(15,29))] = df.iloc[:, list(range(15,29))].replace(fav_map)
Show the code
# Select favorability columns by name pattern and change dtype for correlation laterrank_cols = [col for col in df.columns if'rank_'in col]df[rank_cols] = df[rank_cols].astype('Int64')fav_cols = [col for col in df.columns if'fav_'in col]df[fav_cols] = df[fav_cols].astype('Int64')
Show the code
# Select all numeric columnsnumeric_cols = [col for col in df.columns if df[col].dtype in ['int64', 'int8', 'Int64', 'Int8', 'float64']]# Calculate correlations with incomecorrelations = df[numeric_cols + ['income']].corrwith(df['income'])correlations = correlations.drop('income').sort_values(key=abs, ascending=False)print("Top 5 correlations with income (>$50k):")display((correlations.head(5)))print("\nBottom 5 correlations:")display(correlations.tail(5))df['income'] = df['income'].astype('bool')
From the aforementioned correlations, we can see certain characters such as Luke, Leia, Obi-wan, and Han Solo, have around a 12-15% correlation with income.
Age, had somewhat to do as most would have guessed, with a 12% correlation with income.
The main standout from this, is no one variable is worth exploring as being a predictor of income by itself, but rather an interaction of multiple variables might be worthy pursuing.
In the coming graphs, I am looking for differences in the distributions between the levels of 2 variables, to assess if their is a multiple relationship between them and income. I will only display typical choices, and ones that are interesting.
This first graph is emblematic of age distribution by favorability of all characters. No interesting uniqueness occurs between each level.
Show the code
( ggplot(df, aes(x="age", fill="fav_han")) + geom_bar(position="dodge") + facet_wrap("income") + labs( title="Age distribution by 50k Income Level & Han Solo Favorability", x="Age", y="Number of Respondents", fill="Han Solo fan? (2= very, 1= Somewhat" ) + theme(legend_position ="none"))
The levels of age to education in regards to income being greater than 50 or not, as a small emerging difference, once you get the the older age range (>+50). Having a graduate degree climbs the ladder to being most often among the other levels.
Show the code
( ggplot(df, aes(x="age", fill="educ")) + geom_bar(position="dodge") + facet_wrap("income") + labs( title="Star Wars vs Star Trek Fandom by Income", x="Age", y="Count", fill="Luke Fan? (2= very, 1= Somewhat" ) + theme(legend_position='none'))
Interestingly below, the distributions across all levels of gender, age, and 50k income level bring uniqueness to the data.
Show the code
( ggplot(df, aes(x="age", fill="gender")) + geom_bar(position="dodge") + facet_wrap("income") + labs( title="Distribution of The Levels of Age, Gender, & 50k Income Level", x="Age", y="# of Respondents", fill="Gender" ))
Age shows up again with unique distributions, but this time across answers to the famous question, “Who Shot First?”.
Show the code
( ggplot(df, aes(x="age", fill="who_shot_first")) + geom_bar(position="dodge") + facet_wrap("income") + labs( title="Star Wars vs Star Trek Fandom by Income", x="Age", y="Count", fill="Luke Fan? (2= very, 1= Somewhat" ) + theme(legend_position ="none"))
The distribution across the different regions one is from, should’ve been an easy one to guess to have impact of income level. However, the levels of region to gender, was most revealing.
Show the code
top_regions = df['region'].value_counts().nlargest(6).indexdf2 = dfdf2['region_plot'] = np.where(df2['region'].isin(top_regions), df2['region'], 'Other')( ggplot(df2, aes(x="region_plot")) + geom_bar() + coord_flip() + facet_grid(y='income', x='gender') + labs(title="Region Distribution by Income and Gender", x="Region", y="Income"))
There were many more interesting variations of the distributions levels, usually always revolving around 4 variables, gender, age, education level, and region. Almost none of them were involving the star wars related questions except for one mentioned above.
All in all, the distributions are more similar than different across almost all interactions explored. There are too many variables with not unique enough distributions to pursue logistic regression.
Therefore, I will entertain multiple tree models, including Random Forest, K-Nearest Neighbors, Gradient Boosting, Decision Tree, and NB booster. This project is limited by my knowledge, which only includes these ML algorithms as of today.
Model Prep
Show the code
# Rank education levelsed_map = {'High school degree': 2,'Some college or Associate degree': 3,'Bachelor degree': 4,'Graduate degree': 5,'Less than high school degree': 1}df['educ'] = df['educ'].replace(ed_map)# Separate columns to exclude from one-hot encodingencode_mask = df.columns.str.contains(r'fav_|income|educ|age', case=False)new_encode = df.loc[:, ~encode_mask].astype("category")old_encode = df.loc[:, encode_mask]# One-hot encode the remaining categorical/object columnssw_encoded = pd.get_dummies(new_encode, dummy_na=True, prefix_sep='_', drop_first=False, dtype=int)sw_encoded.columns = sw_encoded.columns.str.replace(' ', '_')# Concatenate back the excluded columnssw = pd.concat([df.iloc[:,0], sw_encoded, old_encode], axis=1)# Drop rows with 7 or more nansw = sw[sw.isna().sum(axis=1) <=7]# Drop rows with nan in outcome variablesw = sw.dropna(subset=['income'])# drop ID columnsw = sw.drop('id', axis=1)
To formally state the target variable:
\texttt{income} =
\begin{cases}
0 & \text{if income } < \$50{,}000 \\\\
1 & \text{if income } \geq \$50{,}000
\end{cases}
Machine Learning Model Introduction
After preprocessing, I assigned the predictors to x and the binary target variable to y = income. The dataset was split into training and test sets using a 75/25 ratio. Missing values in x were either filled with 0 (for models that require complete input like GradientBoostingClassifier and GaussianNB) or left as-is if the model could tolerate it.
I trained and evaluated the following classifiers:
RandomForestClassifier
GradientBoostingClassifier
DecisionTreeClassifier
GaussianNB (Naive Bayes)
‘K-Nearest Neighbors’
Each model’s performance was evaluated using accuracy_score and classification_report from sklearn.metrics.
Show the code
# Assign Predictors and outcome variablesx = sw.drop(columns='income')y = sw['income'].astype(int)
Show the code
# Split dataset into training and testing setsx_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=50)# Split and fill na's for gradient/Gassianx_train_na = x_train.fillna(0)x_test_na = x_test.fillna(0)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Random Forest achieved the 2nd highest overall accuracy, but had a concerning recall score of .42 for the under 50k income level. Upon inspected the confusion matrix, it’s number of false negatives and false positives were 59.
Gradient Boosting had te best accuracy with roughly 70%. Although, its recall score for under 50k was low as well at .31. In essence, great at predicting the true cases, and terrible at the untrue cases. It’s confusion matrix revealed a total of false positives and false negatives to be 51.
** The Decision Tree ** produced a far more well rounded recall and precision scores, but lost much of it’s accuracy in doing so at 58%.
K-Neighbors achieved not so great results with 65% accuracy, and .27 recall for under 50k. Again, good at predicting the >=50k level, but not for the alternative. ## Conclusion and Recommendation
While Gradient Boosting offered a more balanced performance across both classes, I would not consider it a strong choice for predicting income levels. If someone is to use it at all, use it with caution and knowledge that it does not predict lower income level well, as it predicts the upper income most of the time, possibly due to an imbalanced data set with over representation.
Feature Importance in Gradient Boosting Results
I examined which features contributed most to the the Gradient Boosting predictions.
Using the feature_importances_ attribute of the trained model, I identified the top predictors for determining whether a respondent earns over $50,000. These features reflect a combination of demographic factors and Star Wars-related preferences that the model found most useful for predicting income level.
In the model, feature importance was heavily favored towards 5 features. Among the features, age and educ were most important, which aligns with expected links between demographics and income. Interestingly, how participants viewed Ep 6, had an importance of .155. However, this was just over half of age’s important of .298.
In conclusion, it seems as if star wars related preferences, at least from this dataset, are not very useful in predicting income levels. Perhaps more representation of those under 50k, and better clean data can boost this predictive model.